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1 . Summary of proposals 

The document includes proposals for the following inventions: 

Programmable 40 Gbits/s Traffic Handler - A traffic handler architecture In which packets are processed by soft- 
ware and inserted into an orderlist for scheduling. 

Packet storage system for traffic handling - A Memory Hub, used for buffering packets in a high line rate Traffic 
Handler. 

State Element - A smart memory cell for serialising accesses to shared state variables. 

State Engine - A formal framework for designing an active state storage system using state elements 

Programmable orderlist manager - A system for maintaining ordered logical data structures in software at high 

speeds 

Overlapped Virtual Queuelng - A low overhead method for setting up and tearing down virtual queues 



Notes: 

• Figure and reference indexes apply locally within each chapter of this document 

. This is a summary document. There will be a iot of additional detail (functions and claims) relating to each 

proposal which may not be covered in this document. 
. In each proposal, additional related design work is also listed. These are Ideas which are to be considered 

to have potential either as sub-patents, or maybe as independent proposals in their own nght. 
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2. A programmable 40 Gbits/s Traffic Handler 



2.1 Background 

A basic knowledge of the function and anatomy of an internet router Is assumed. 

A routers switch fabric can deliver packets from multiple ingress ports to one of a number of egress ports. The 
. iinecard connected to this egress port must then transmit these packets over some communication medium to the 
next router in the network. The rate of transmission is normally limited to a standard rate. For instance, an OC-768 
link would transmit packets over an optical fibre at a rate of 40 Gbits/s. 

With many independent ingress paths delivering packets for transmission at egress, the time averaged rate of de- 
livery cannot exceed 40 Gbits/s for this example. Although over time the input and output rates are equivalent, the 
short term delivery of traffic by the fabric Is bursty in nature with rates often peaking above the 40 Gbits/s threshold. 
Since the rate of receipt can be greater than the rate of transmission, short term packet queueing required at 
egress to prevent packet loss. A simple FIFO queue is adequate for this purpose for routers which provide a flat 
grade of service to all packets. 

More complex schemes are required in routers which provide Traffic Management. In a converged internetwork, 
different end user applications require different grades of service in order to run effectively. Email can be carried 
on a best effort service where no guarantees are made regarding rate of or delay in delivery. Real-time voice data 
has a much more demandingrequirement for reserved transmission bandwwidth and guaranteed minimum delay 
in delivery. This cannot be acheived ff all traffic is buffered in the same FIFO queue. A queue per so-called Class 
of Service is required so that traffic routed through higher priority queues can bypass that in lower priority queues. 
Certain queues may also be assured a guaranteed portion of the available output line bandwidth. The ClearSpeed 
view of Traffic Handling in context is described in the ClearSpeed Traffic Management system whitepaper [1]. 

2.2 The problem and prior art 

On first sight the traffic handling task appears to be straightforward. Packets are placed in queues according to 
their required class of service. For every forwarding treatment that a system provides, a queue must be Implement- 
ed. These queues are then managed by the following mechanisms: 

• Queue management assigns buffer space to queues and prevents overflow 

• Measures are Implemented to cause traffic sources to slow their transmission rates if queues become 
backlogged 

• Scheduling controls the dequeueing process by dividing the available output line bandwidth between the 
queues. 

Different service levels can be provided by weighting the amount of bandwidth and buffer space allocated to dif- 
ferent queues, and by prioritised packet dropping in times of congestion. Weighted Fair Queueing (WFQ), Deficit 
Round Robin (DRR) scheduling, Weighted Random Early Detect (WRED) are just a few of the many algorithms 
which might be employed to perform these scheduling and congestion avoidance tasks. See reference [2] for a 
thorough description of these algorithms. 

in reality, system realisation is confounded by some difficult implementation issues: 

• High line speeds can cause large packet backlogs to rapidly develop during brief congestion events. Large 
memories of the order 500 MBytes to 1 GBytes are required for 40 Gbits/s line rates. 

• The packet arrival rate can be very high due to overspeed in the packet delivery from the switch fabric. 
This demands high data read and write bandwidth into memory. More importantly, high address bandwidth 
is also required. 

• The processing overhead of some scheduling and congestion avoidance algorithms is high. 
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. Pribrify queue ordering for some (FQ) scheduling algorithms Is a non-trivial problem at high speeds. 
. A considerable volume of state must be maintained in support of scheduling and ^ng^™^ 1 ™ 

algorithms, to which low latency access is required. The volume of state increases wrth the number of 

queues Implemented. 

. As new standards and algorithms emerge, the specification Is a moving target. To find a flexible (ideally 
programmable) solution is therefore a high priority. 
In a conventional approach to traffic scheduling, one might typically place packets direct) y in f an a PPmpriate 
queue on arrival, andthen subsequently dequeue packets from those ^J^J^SS^S^SSK 
s-hsdulor determines the order of dequeueing. Sines the scheduling decision can be p. oc«. a l, ,y 1. Ker, a ,vo as ma 
^e^itp^q^ increases, queues are often arranged into small groups which are locaHy scheduled Into 
anTnter^ediate output queue. Th s output queue is then the Input queue to a following schedul.ng stage. The 
scheTuCproblem's thus simplified using a •dlvida-and-conquer' approach whereby W^^™** 
acheived trough parallelism between groups of queues in a tree type structure, or so-called h.erarchical link shar- 
ing scheme [2]. 

This approach works in hardware up to a point. For the exceptionally large "^^^ <"^ a i* ^ order 
64k) required for per-flow traffic handling, the first stage becomes unmanageably wide to a po.nt that it becomes 
impractical to implement the required number of sched ulers. 

Alternatively In systems which aggregate all traffic into a small number of queues parallelism betoeen hardware 
Sufel ^ cannot be exploited. It then becomes extreme* difficult to implement a single scheduler - even In op- 
timised hardware - that can meet the required performance point. 

With other congestion avoidance and queue management tasks to perform in addition to scheduling, it Is apparent 
that a new approach to traffic handling Is required. 

2.3 Summary of the invention 

A traffic handler architecture in which packets are processed by software and inserted into an orderllst for sched- 
uling. 



2.4 List of attached figures 
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Figure 2.1 Traffic Handler system functional overview showing principal components 
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Figure 2.2 Functional overview of the system of MTAPs and other ASiC cores in the Q-chip 
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2.5 Detailed description 

Description of the concept and Invention 

• There ere no separate, physics! stagel Input Queues. 

• Packets are effectively sorted directly into the output queue on arrival. A group of input queues thus exist in 
" the sense of being Interleaved together within the single output queue. 

• These interleaved 'input queues' are represented by state In the queue state engine.This state may track 
queue occupancy, finish time/number of the last packet in the queue etc. Occupancy can be used to deter- 
mine whether or riot a newly arrived packet should be placed In the output queue, or whether it should be 
dropped (congestion management). Finish numbers are used to preserve the order of the Input queues' 
within the output queue and determine an appropriate position in the output queue for newly arrived pack- 
ets (scheduling). 

• Scheduling and congestion avoidance decisions are thus made "on the fly" prior to enqueueing - a tech- 
nique referred to within ClearSpeed as 'Think first queue later"™ . 

• This technique is made possible by the deployment of a high performance data flow processor which can 
perform the required functions at wire speed. The ClearSpeed MTAP processor is ideal for this purpose, 
providing a large number of processing cycles per packet for packets arriving at rates as high as one every 
couple of system clock cycles. 

Details of the embodiment 

Figure 1 shows the MTAP processing system in relation to other components in the wider Traffic handling system. 

The packet buffering system and orderlist management system are described in detail in sister patents as each is 
an innovative solution to a more specific problem. 

Figure 2 shows a functional decomposition of the MTAP processing system. 

This architecture is described in detail In ApplicatlonrNote 1 of the Per-Flow Traffic Handier design document [3]. 
This device is referred to as the Q-chip. 

Figure 3 shows a full traffic handler implementation using the Q-chip architecture. 

Q-Chip 2 is used to implement the orderlist management system. The Memory and Streaming hubs implement the 
Packet Buffering System. 

Additional related design work 

There are some additional points to note in our use of MTAP processors to perform Traffic Handling functions. Not 
sure whether they form claims in this invention, or whether they are patentable ideas In their own right. 

Class of service tables: CoS parameters are used in scheduling and congestion avoidance calculations. They are 
conventionally read by processors as a fixed group of values from a class of service table in a shared memory. 
This places further demands on system bus and memory access bandwidth. The table size also limits the number 
of different classes of service which may be stored. 

An intrinsic capability of the ClearSpeed MTAP processor is rapid, parrallel local memory access. This can be used 
to 'advantage as follows: 

• The Class of Service table is mapped into each PEs memory. This means that ail passive state does not 
require lookup from external memory. Enormous internal memory addressing bandwidth of SIMD proces- 
sor is utilised. 

• By performing multiple lookups into local memories in a massively parallel fashion instead of single large 
lookups from a shared external table there is a huge number of different Class of Service combinations 
available from a relatively small volume of memory. . 
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• Table sharing between PEs -PEs can perform proxy lookups on behalf of each other. A single CoS table 
can therefore be split across two PEs thus halving the memory requirement. 

In the context of the ordinary use of MTAP processors this is not necessarily innovative. As a tool being used to 
improve function and performance in a Traffic Hand iiing' system this gives a reai advantage over the shared 
memory approach. 

Reference material 

[1] A. Spencer "Traffic management Whftepaper" 

- Background information on Traffic Management 

[2] S. Keshav "An Engineering Approach to Computer Networking", Addison-Wesley, 1997 

- Scheduling, congestion avoidance and hierarchical link sharing theory 



- Original design work for ClearSpeed Traffic Handling solution 



2.6 Key features of the invention 

• Traditional packet scheduling Involves parallel enqueueing and then serialised scheduling from those 
queues. For high performance traffic handling we have turned this around. Arriving packets are first proc- 
essed In parallel and subsequently enqueued in a serial orderiist. This is referred to as 'Think First Queue 
Later"™ 

• The deployment of a single pipeline parallel processing architecture (the ClearSpeed MTAP processor) is 
innovative in a Traffic Handling application. It provides the wire speed processing capability which is 
essential for the implementation of this concept. 

• An alternate form of parallelism (to independent parallel schedulers) is thus exploted in order to solve the 
processing issues in high speed Traffic Handling. 



2.7 Scope of claim 

The claim applies most specifically to the use of MTAP processors in a Traffic Handling device used for network 
traffic management The claim could be broadened beyond the specific use of MTAPs to cover the more general 
TF-QL concept and the implementation of the orderiist. 



[3] A. Spencer "Per-Flow Traffic Handler" 
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3. A packet storage system for traffic handling 

3.1 Background 

A basic understanding of router anatomy and traffic handling is assumed, 
queues. 

3.2 The problem and prior art 

l am unaware of what P^r ar, exists mat might — 

Tern The following issues in combination make packet buffering particularly difficult at 40 Gbits/s line rates. 

1. High data bandwidth is required to accommodate the simultaneous reading and writing of packets (at worst 
case fabric overspeed). 

2. High address bandwidth is required to cope with the worst case «^«^J*£^ M P **" 
ete fare simultaneously being written to and read from the memory ,n a random access mode. 

3 Memory capacity must be high as buffers will fill up rapidly during transient bursts at high line rates. 

A The manipulation of state which is associated with either logical que ^^^iZ ^1^^ 

ment must be minimised at high line rates. The number of system clock cycles typically available to the 

hardware or software device which performs such a function will be minimal. 

■wmmmmmm 

interfaces. 

In summary, it is very difficult to design architectures that meet all four criteria. 

3.3 Summary of the invention 

A Memory Hub, used for buffering packets in a high line rate Traffic Handler. 
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3.4 List of attached figures 
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F/gt/re 3. 1 Functional overview of the components of the packet storage system 
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Figure 3.2 Architectural overview of the Memory Hub 
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DATAGRAM RE™gi*L'i!)!I. 




Figure 3.3 Datagram Retrieval Unit design 




Figure 3.4 Implementation of a packet storage system for traffic handling, (a) Multi-chip, scalable implementation 
approach, (b) Single chip, highly integrated solution. 

3.5 Detailed description 

This proposal describes an Invention in which component behaviours, ideas and devices are ^sembled into a 
Stfon which meets all the required criteria for 40 Gbits/s packet buffering in a traffic handling systerru Al hough 
certTpXheral behaviours form part of the overall solution, the Memory Hub is the primary embodiment of the 
invention. 

Description of the concept and Invention 

^mnrv^m dscouolina - First, isolate the problem so that it is not entangled and interdependent with other 
functions. 

. Relatively complex functions may be used to control the enqueuing and de queueing of ^P^kets^The 
complexity of such packet handling/processing can be alleviated somewhat if the jackets themselves are 
Raised around the system. Since access to packet content is not required in ^*™^*»&' 
ets can be placed in memory and be represented by small, fixed size packet records.. It is these records 
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which are processed and manipulated in logical queues or data structures. The packet records, scheduled 
In order, can subsequently be used to recover packets for forwarding on the output line. 

• The processing and logical queue management functions are thus decoupled from the task of packet buff- 
ering and memory management. 

• Subsequent QoS processing Is performed on a small, fixed sized record of packet metadata. This record 
typically comprises the packet location in memory (address of first block), the identity of the stream to 
which the packet belongs (appended to the packet upstream), the packet length and control flags. Addi- 
tional data is looked up locally using the'stream identifier. 

Memory mansosmsnt - Memory is generally statically or dynamically assigned when it is used as storage for data 
structures. Next define an efficient scheme for assigning memory to packets. 

• Packet memory divided into small blocks in the memory address space. The blocks may be one of n dif- 
ferent configured sizes. For reduced system complexity, n=2 is considered suitable. There is no static 
assignment of memory to queues. Instead, packets are stored Into one or more blocks of a given size (as 
appropriate). Each block of a given packet points to the next in linked list fashion. 

• (It is emphasised here that) packets do not point to one another. In other words, there is no logical queue 
management. 

• The availability of all memory blocks Is recorded by a memory manager in the memory hub. To do this the 
memory manager employs a bitmap - each bit representing a block. 

• By reading words from this bitmap, the memory can identify the addresses of free blocks In batches. Bits 
are converted Into addresses, and the addresses held in a central pool of limited but adequate size. This is 
a form of data decompression which is more storage efficient than maintaining a memory freelist 1 . (ie. stor- 
ing all possible addresses in a queue or linked list). 

• The central pool can be topped up by either scanning the bitmap, or more directly from the stream of 
addresses which arrives as packets are read from memory and the memory blocks they occupy are 
released. If the central pool is full, returned addresses must be buffered and their information inserted into 
the bitmap. 

Efficient packet storage - Placing information into memory will normally require the overhead of updating state 
which records the presence of the new information. A means of smoothly streaming data into memory at 80G is 
required which does not get held up by intermediate state manipulation. 

• The device that receives packets from the switch fabric is referred to as the Arrivals block. This pipelined 
processor extracts Information from the packet which is used to create the packet record, and slices the 
packet up into chunks which can be mapped into memory blocks. An appropriate fragment size (and thus 
memory block size) is selected for each packet based on its length.The packet is forwarded to the Memory 
Hub and the packet record to the system use for QoS processing and logical queuefng. 

• Arrivals maintains its own local pool of available memory blocks by periodically reading batches of 
addresses from the central pool. A separate local (and central) pool Is required for each different memory 
block size implemented. 

• This local pool enables Arrivals to load (fragmented) packets immediately into free blocks of memory. The 
- only minor complexity in doing this is to insert the address of the last memory block of the same packet into 

the current block. It is a simple, fast system which requires no state manipulation other than to pop items 
from the local pool. 

• An important feature of this system is that it supports load balancing across the available memory chan- 
nels of the Memory Hub. When the local pool is replenished, it receives the addresses of memory blocks 
which are mapped into different physical memory channels in equal quantities. Consecutive blocks can 
thus be written to memory blocks in round robin fashion, efficiently spreading the address and data band- 
width load. 

• In the unusual event that the local poo! becomes empty, packets may need to be partially or fully dropped 
by Arrivals. These events are still reported through packet record creation so that event handling may be 
managed centrally by the QoS processor. The processor must purge the packet from memory, and must 
report the details of any dropped packet. These are both functions the QoS processor already possesses 
for normal operation. 
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Efficient oacket recovery - The packet storage function is deliberately simplified as its 80G performance requfre- 
SSSSSSSSLSX. "open loop' method used by Arrivals makes the 40G packet recovery £nc- 

data (packet) from memory (the memory hub), and forwards the data to the line, 
block is still being read from memory. 

— |k;»'«W«"« l »"' l * ,W,,,,, ~ 

* to a previous request has returned). 

tk« ramwArv of multi-block oackets can be made efficient by implementing a hierarchical packet retrieval 

and end of valid data within the block. 

single blocks. Packets which are stored in multiple blocks must be read from memory so that the next 

block' pointers can be recovered. 
i n ri ~Hnn,vh.hUi astern -The partitioning of a system must be considerate of real world issues such as de- 
Z ?nS? P ower dissipat ion, silicon process technology, PCB population and routmg etc. 

. Multiple, independently addressable memory channels are accessed via a ^emoor Hub The Hub Isolates 

maximises the number of pins available for memory channel implementation, and (b) mulhple hubs may be 
connected to a single QoS processing chip (see next point). 

S &l pincount and reducing the packaging cost of a potentially complex ASIC. 
Details of the embodiment 

Referto the Per-Flow Traffic Handler design document [1] for further details on the microarchitectural design of a 
a Memory Hub. 

Figure 1 illustrates how Arrivals and Dispatch provide the fork and join points In , the i packet sfream They teolate 
the memory hub which simly distributes packet chunks received from the high speed link to memory, or retrieves 
packets from memory on request. 

Figure 2 shows the content and Interconnection of the Memory Hub Jh more .detail n ™^*™™^ 

between the DRUs and the memory channel controllers. This -eans that packe^ 

on multiple memory channels and fetched/reconstructed by a single DRU. Multiple DRUs exist to increas 
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bandwidth and provide load balancing for single line output (OC-768) systems. Alternatively, in multi-line systems 
(eg. 4 x 10G Ethernet) a single DRU is assigned per line. 

The memory manager contains the central pool and the memory bitmap. The memory bitmap would be Implement- 
ed in embedded SRAM and would typically be 250k bytes in she In order to support a 1G packet memory organ- 
ised as 512 byte memory blocks. 

Figure 3 shows the content and interconnection of the DRU block. The controller supervises read pipelining or the 
process of packet recovery from a linked list. 

Figure 4 shows two possible system implementations. 8 channels of memory provide approximately 20 GBytes/s 
data bandwidth , and 300 M random accesses per second. This meets requirement for 40G traffic handling (2x 
switch fabric overspeed). While is is conceivable that the whole system could be implemented in a single device 
(b), it may be more practical to distribute the function over multiple devices (a). There are many good reasons for 
doing this, including: 

• The physical distribution and second-level interconect of a large number memory devices on a PCB is 
alleviated if all devices are not clustered arount a single highly integrated processor. 

• it may be difficult to observe tight specifications on electrical characteristics, signal line separation and line 
termination if memory channels are clustered densely around the perimeter of a single chip. 

• Power dissipation is more evenly spread. The combination of high power and pincount in a single device 
can require expensive packaging technology. 

The multi chip approach is also versatile in that the number of hubs used can be scaled to meet the memory re- 
quirement of the application (eg. scaling from 10G to 40G, and beyond) 



Additional related design work 

Reference [1] also describes in detail the Arrivals and Dispatch blocks which could be instanced in a Streaming 
Hub. Whether these blocks contain intellectual property which should be protected is unclear. 

The Streaming Hub: - is an implementation approach which could be used to localise noisy, power hungry high- 
speed serial interfaces in a single device with minimal additional logic. Dispatch blocks must be co-located on a 
single chip such as this with connections to all memory hubs so that a crossbar switch between multiple hubs and 
multiple output lines can be provided. 

The Arrivals block: - implements pipelined processing techniques in order to perform rapid, on-the-fly record ex- 
traction. A record is forwarded for every packet. Flags in the record are set to indicate whether a packet is fully 
buffered, partially buffered, or dropped in Arrivals (dependant on the status of the packet buffer). This enables the 
processor to perform appropriate housekeeping and error reporting. 

The Dispatch block: - is a DMA engine which reads a stream of records, retrieves the corresponding packets from 
the Memory Hub and then forwards them to the output line. Dispatch uses its output buffer occupancy as a servo 
signal to the memory hub to control the rate at which the Hub delivers packet data. 



Reference material 

[1] A. Spencer "Per-Flow Traffic Handier" 

- Original design work for ClearSpeed Traffic Handling solution 



3.6 Key features of the invention 



12 Document Number VRev 

CompanyCcnfidentlal Confidentiality Level: RED • We.rSpe.rf technology Kd 2001 




.Trafe''H.andler>^ 



. Queue management and memory management are fully decoupled - (manifested by the packet/packet 
record concept. This addresses criterion (4)) 

tion limits - addresses (1) and (2)). 

3.7 Scope of the claim 

TraffictandS ischaLterised by high line rates and high speed random access into a large memory. 

More broadly, the approach oould be applied wherever (latency tolerant) high speed random access Is required 

into a very large memory. 
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4. The state element 

4.1 Background 

Situations often arise whereby a function must be performed on a continuous stream of data. If the function is im- 
plemented In software on a processor, then each datagram (packet of data) which arrives in sequence from the 
stream must be stored, processed and then forwarded. This process will take some finite quantity of time to exe- 
cute. As the rate of packet arrival Increases there will come a point at which a single processor can no longer keep 
up. The function must then either be distributed across multiple processors arranged in a pipeline, or across mul- 
tiple processors arranged in parallel - each receiving a packet from the stream in turn In some round robin se- 
quence. Packets output from parallel processors are typically reordered before forwarding. 

This is fine, as long as there is no interdependence between the processors. They operate independently of one- 
another, perhaps sharing a common code or data store into which all have read only access. 

4.2 The problem and prior art 

A problem arises when such processors share state variables for which both read and write access is required. 
Processors can not be permitted to simultaneously read/modify/writeback a shared variable as the result from the 
first writeback will be overwritten by the second. It is necessary to serialise the accesses. This raises two signifi- 
cant issues; 

1 . A system for interlocking processors together must be implemented so that they may arbitrate for a 
resource and then lock it when there is contention. This control signalling can be complex and add signifi- 
cant functional and performance overhead. 

2. When a processor has successfully negotiated for a resource, it should use that resource and then release 
it as soon as possible to limit the delay imposed on other processors. If access latencies are long to exter- 
nal memories, this can impact heavily on system performance. 

Semaphores can be used to interiocck processors, or control logic and caches can be used to intercept concurrent 
accesses and serialise them; however, these can be complex, slow and/or require significant support tied Into 
hardware. Embedded memory can reduce lock out time, but delays can still be significant. 

4.3 Summary of the invention 

A smart memory cell for serialising accesses to shared state variables. 
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Traffic Handler patent summary, r;, 



4.4 List of attached figures 



(a) 



(b) 
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F/gt/re 4. y Illustrating the advantage of the state element concept, (a) conventional approach, (b) using state ele- 
ments. 
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Figure 4.2 Functional overview of the state element 
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Figure 4.3 Implementation overview of the state element 
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4.5 Detailed description 

State elements are the key components which perform the serialisation of acesses into ■ ^ 
oaten describes only the state element. In a real system state elements must be combined 
SSSSXX bus The Innovative arrangement of state elements into larger state engines (which can be con- 
nected to a system bus) is covered by a sister patent - "The State Engine . 

Description of the concept and Invention 

instead of aettinq parallel processors to read from memory, modify, writeback data, get them to request that the 
SSi pSSl ^morton on its behalf. In other words, position the 

each processor, but in a simple shared processor which has local and rapid access to the memory in which the 
shared state variables are stored. 
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The state engine concept and the advantages it brings is Illustrated in figure 1. 



The state element is analogous to an object jn OOD. It has privately stored data which is accessible only via the 
objects methods. By issuing commands, parallel processors could be considered to be making function calls to 
the object. 

A state element comprises a small block of embedded memory with single cycle read/write access time combined 
with a simple arithmetic and logic unit. The Arithmetic Unit receives commands (from processors) which comprise 
an address, data and a command code. The address identifies the siaie variable which is to be accessed, the data 
provides operands which simple compute uses to modify the variable, and the command selects a locally stored 
thread of programmed microcode which is able to read, modify and writeback the state variable within a very small 
number of system clock cycles. The result can be returned to the processor which Issued each command. 

These components are shown in figure 2. 
Details of the embodiment 

Refer to the State Engine design frame work document [1] for full details of the microarchitectural design and im- 
plementation of a State Element. 

A smart memory element comprises an embedded memory and an attached function - the function could either 
be hardwired (a finite state machine), or a programmable, mlcrocoded circuit. The latter approach is the more ver- 
satile and complex, and receives further attention in this document. 

A more complete picture of the system of component modules and their interconnection is shown In figure 3. Note 
the presence of special function and condition blocks. These greatly extend the functional capability of the element 
(as described in ref [1]). 

The emphasis in state element design Is on the rapid memory access speed, not the processing capability. Em- 
bedded memory blocks are small enough that single cycle access time is achievable. Configurable R/M/W is pos- 
sible within a two cycle period as it Is possible to perform a simple arithmetic operation on the result of a read and 
have it turned around for writeback within the second cycle. Typically, a command could be fully processed within 
typically 3 to 5 clock cycles. Figure 4 illustrates the simplicity of the arithmetic unit, and how the path between the 
command line and the memory has minimal delay. The lower diagram shows a more complex variant in which 
multiple items of state are held in memory. The impact on the command line turnaround (and microcode store size) 
is significant. (However, this is not to say that the lower circuit should not be used. In a lower performance system 
with a more complex set of state it could be the preferred approach). 

Additional related design work 

Reference [1] also covers some algorithmic techniques used in conjunction with state elements. 

System threads: - Background, system threads could be programmed to operate on the data in the state memory 
when commands from processors are not being serviced. For instance, could be useful for identifying state entries 
which are Idle. 

Find free queue algorithm: - Find_free_queue system function. This is a background thread which implements a 
Two strikes and out" algorithm for de-assigning state entries used to represent/manage queues which go idle (ie. 
empty). 

Special function units: - The 'flag unif and 'address unif are special function units designed to support the find free 
queue algorithm. The features they provide are considered to be of generic value and could be used by other al- 
gorithms (such as that required for maintaining meters in state elements) 

Scheduling algorithm: - The information required by the Self-Clocked Fair Queueing algorithm cannot be mapped 
directly into the state element. It is represented in a form which makes access and manipulation more robust and 
efficient. Is this a claim relevant to the state element itself or the software using the state element? (see ref [2]). 
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Reference material 

[1] A. Spencer "State Engines - a design framework" 

- Originai design work for ClearSpeed State Engine solution 
[2] A. Spencer "Per-Flow Traffic Handler* 

- Original design work for ClearSpeed Traffic Handling solution 

4.6 Key features of the invention 

. intelligent memory - The state element localises the serialisation of parallel data accesses : at ^ memory 
end. not the processor end. This greatly reduces the latency commonly associated with the blocking of 
state. 

- Functional versatility - The state element provides a number of (configured/)programmed remote functions 

which may be performed on the stored data - functions would comprise a small number of include data 

read, write, arithmetic operations and conditional accesses. 
. Flexibility - The functions can (but not must) be expressed in microcode so that the state Jement remains 

programmable and does not 'tie' software executing on the processor to functions hardwired into the state 

iement. 

• System efficiency - The read/writeback occurs between the ALU a indthei ^ m ^ r ^^^f^^ 
Only the command travels across the system bus. This reduces the burden on the system bus as com- 
pared with conventional approaches. 

. System simplicity - The read/modify/write is encapsulated within the state element and serialisation is 
inherently enforced by the state element logic. Processors can simultaneously issue commands which will 
cause a function to act on the same item of state without having to first negotiate with one-another. 



4.7 Scope of the claim 

The problem was identified while using MTAP processors to access shared state in a Traffic Handling application, 
and state elements were conceived to resolve the contention issue. 

It is recognised that contention is not an issue exclusive to Traffic Handling, therefore state elements could be used 
as a general purpose tool in support of MTAP processors in any application. 

Most broadly, contention can arise when any two processors In a realtime (data flow processing) systern require 
RM/W access to a shared state variable. State Elements could therefore be used in conjunction with any parallel 
or pipelined arrangement of processors. 
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5. The state engine 
5.1 Background 

Situations often arts© whereby functions must be performed on a continuous stream of data. If the functions are 
implemented in software on a processor, then each datagram (packet of data) which arrives in sequence from the 
stream must be stored, processed and then forwarded. This process will take some finite quantity of time to exe- 
cute. As the rate of packet arrival increases there will come a point at which a single processor can no longer keep 
up. The function must then either be distributed across multiple processors arranged in a pipeline, or across mul- 
tiple processors arranged in parallel - each receiving a packet from the stream in turn in some round robin se- 
quence. Packets output from parallel processors are typically reordered before forwarding. 

This is a well proven approach to high performance packet processing, but is limited in its scalability as the number 
of processors Increases. Access to shared memories, be it for code or data, eventually becomes a bottleneck. Si- 
multaneous R/W access to shared state will further add to the complexity of system control signalling in order to 
resolve contention. 

MTAP processors resolve traditional issues relating to instruction lookup, and State Element technology supports 
parallel processing systems by localising and managing serialisation to shared state. (Both technologies are pro- 
vided by ClearSpeed Technology Ltd). This leaves the issue of high speed access to multiple items of shared state 
information by multiple parallel processors. As the number of processors and the complexity of their algorithms 
increases, address and data bandwidth requirement over the system bus to the shared data will also increase. 
This can then become a bottleneck, 

A good case in point is the challenge of Traffic Management in network routers. A significant, recognised issue in 
per-flow Traffic Handling Is that a number of items of state need to be maintained for each of a large number of 
queues. The implications of this are that (a) a considerable volume of shared memory needs to be implemented, 
(b) a lot of memory address bandwidth is required if each queue requires that separate accesses be made td dif- 
ferent (shared) state variables, (c) the memory access latency is likely to be long thus causing state blocking dur- 
ing modification to impact on performance. 



5.2 The problem and prior art 

Contention for shared state variables can be resolved by implementing state elements as described in reference 
[1] and in the 'State Element* patent proposal. However, the successful implementation of the state element con- 
cept in high performance systems requires additional innovation to overcome the following: 

1 . Arranging processors In parallel can create a high rate of access to the same item of state. 

2. What if a given function needs to access to multiple variables from the same address. In other words, 
needs to access and process a state record? 

3. What If multiple functions executing in a processor on a single datagram each require access to different, 
Independently addressable tables of state variables or records? 

In short, the fundamental problem being addressed is that of a high rate of state access. This problem must be 
solved in a flexible way which enables the easy scaling of both the quantity of state being stored and the rate of 
state access. 



5.3 Summary of the invention 

A formal framework for designing an active state storage system using state elements 
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5.4 List of attached figures 
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F/gu/B 5. 1 Conceptual design hierarchy of the state engine 
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Figure 5.2 Functional view of a state engine 




Figure 5.3 implementation of the state cell 
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Flgum 5.4 The Queue 



State Engine -an implementation example of a complex state engine design for Traffic Han- 
dling and queue management 



5.5 Detailed description 

Description of the concept and invention 

which may be grouped together to form state engines. 

This hierarchies design framework Is illustrated in figure 1. The component parts shown are: 

gj^ssoBL- This is a conceptuai entity only. It is a group of one or more state vanab.es which share a common 

address. 

eral purpose data field. 

gaaj^ . As *«*«. m re fc ,. r [11 ^^^^^'^SSS^S- 
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result may also be recorded in a data field in the command line. The primary role of the state element Is to manage 
the state access serialisation point by executing a simple function on memory at maximum speed. 
State Cell - If there is more than one state variable in a record, It is permlssable for the entire record to be stored 
^anenfry within e single state stemeni. Howsvs.-, as sach Said in the record %-culd .-.sad to be procssssu ... «... 
this wouWthrottle the available bandwidth to the state. In the State Cell each field of the record is stored es a s.ngle 
state variable in its own State Element These State Elements are then chained together In a pipell ne The com- 
mand line passes from one Element to the next, the same address and control word being used at each stage to 
pick a different field from a common record and perform some function on it. State cell logic provides synchroni- 
sation between its constituent Elements which effectively make up a memory oriented P ! P elined processing sys- 
tem. 

The primary role of the State Cell is thus to provide a means of constructing simple, pipelined processors which 
enable more complex state records to be handled at high speeds. 

state Array - The embedded memory used in the State Elements of State Cells must be relatively small in volume 
for rapid (ideally single cycle) access. This places a limit on the number of Instances of a state ^ord which may 
be stored in a single State Cell. To increase the quantity of state. State Cells of a given type can bettedlo^m 
a large State Array. Scaling during device layout is simplified by the State Array interconnect. Th ^egmentabon 
of an interconnection framework and the coupling of adjacent Cells in a tiled array using wen, define Unte Jces te 
shown in the accompanying figure. The interconnect preserves order between accesses to the same State Cells. 
SmTordirr preservation amongst command lines accessing different State Cells Is not requlrec .there « no need 
for the latency of command line accesses to different Cells across the array to be balanced. The Array is scalable 
in a simple way and is layout-friendly. 

Increasing the total state storage volume by multiplying State Cells can also Increase overall state access band- 
width as the throughput of an individual State Cell is likely to be a little lower than that of the interconnect I the 
Tmber of State Cells Is increased to the point that the interconnect becomes the limWng factor then ^rega e 
throughput can be further increased by providing multiple interconnect channels - each ch a "^« a ^9 li- 
ferent portion of the array (ie. table). This is analogous to designing a memory system wrth multiple, independently 
addressable channels to increase random access bandwidth. 

The primary role of the state array is to provide scalable capacity. It also provides a means for scaling address 
and data bandwidth. 

SJateEngine - The State Engine combines State Arrays with all the additional glue logic and facilities which are 
required to construct a block which can be configured and accessed via a system bus. Components include: 

• Bus interface logic 

. System control logic - The state engine controller may issue (private) s y sle ^ m ^ a n n . d n s . | ^ t v ^ futility 
arrays These commands are invoked by external blocks through accesses to the control! er via i the > uHlrty 
b^nterface Only (public) state commands may arrive via the main data flow interfaces. System com- 
mands configure the arrays or extract diagnostic information. 

• Bypass logic - Bypass modes enable commands to skip arrays which they are not required I to access This 
will conse^e power and bandwidth. The required extraction and insertion points can also be used by the 
system controller. 

. Inter array switch connectors - Novel(?) application of (Banyan) switching ^^^^^^^M 
between tables. Only required when there are more than one independent route through each State Array. 

State Engine behaviours include: 

. Message broadcasting - System commands can be broadcast throughout the 

status or passing configuration and control messages. This method is also used for loading microcode into 
state arrays. 

. Multiple accesses - If multiple arrays are connected in a pipe then it is evident that e ^^ mm ^ d H ^ m 
must Contain different address and command information for each array. A single command issued from 
the processor thus results in multiple state accesses. 
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. Command line "morphing" - As a command lines propagate from ^j"^ 

times updated as a result of each state element access. The data inserted into the command by state ele- 
ments in one array could be used by the state elements in the next. Data and perhaps even addresses 
could be modified. 



Details of the embodiment 

Example state engine achitectures are documented in the "State Engine design framework" document [1]. 

The design of the state element is detailed in the S'State ELemenf patent. 

The design of the State Cell which stitches Elements together in a pipeline is shown in figure 3. 

The architecture of the State Array, and the interconnection of State Engine components is illustrated by figure 4. 
This shows a three-array instance of a Queue State Engine which supports per-flow Traffic Handling. 

Additional related design work 

l oad balancing - It is possible that state records may be allocated dynamically on demand (and also deassigned). 
If multiple paths exist through a given array then it is desirable for the stored state to be spread evenly r across fee 
available State Elements/Cells. The availability of state entries in such a system could be advertised by the Con- 
trailer in such a way as to ensure that records are assigned from each Element In turn thus balancing the load. 

Systejg threads: - 
Reference material 

[1] A. Spencer "State Engines - a design framework" 

- Original design work for ClearSpeed State Engine solution 
[2] A. Spencer "Per-Flow Traffic Handler" 

- Original design work for ClearSpeed Traffic Handling solution 



5.6 Key features of the Invention 

All of the specified issues associated with high speed data lookup by parallel processors are addressed: 
. A formal framework for creating a parallel coprocessor using smart memory (state elements). 
. Sinale access multiple lookups - A single access acts upon multiple, independent state tables within the 
state ! e^^te mSSple lookups into different tables held in different memories as a result of a single 
request from the bus. 

. Pipelined architecture - Lookups into different tables are not fired off from a point source into dlffarwrt 

memories. Instead, the access itself (in the form of a command line) is routed from table to table in a serial 

fashion. It is an object which travels through the QSE. 
. Command line "morphing" - As command lines propagate along the pipe ^ m ^ to tobje^jey are used 

and sometimes updated as a result of each table access. The data inserted into the command by one table 

could be used by the state elements in the next. 
. State cell concept - high throughput pipelined processing (scalable processing power) 
. State array concept - 'layout friendly' scheme for scaling quantity of state, bandwidth and load balancing 
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Triaffib Handler patent summary 



6. Programmable orderlist manager 



6.1 Background 

A basic understanding of router anatomy andtraffic handling is assumed. 

in traffic handling packets may be placed in one of a number of queues. With more than one queue i present, a 
IchSniSon must deteLne the order in which packets are served from the queues. The scheduled order 
fs^eteSed principally by the relative priorities that the scheduler places on the queues - not on the order ,n 
whtll pTckete arrived at the queues. The scheduling function Is thus fairly serial .n character. 
For example, consider the two popular scheduling methods: 

1 Fair Queue scheduling - every packet in the queue is given a finish number which indicates the relative 
1 • po 8 Hto "metatlhe packet is entitled to be output. The function that ^^!S£SSS^^ 
identify the queue whose next packet has the smallest finish number. Ideally, only after the packet nas 
Sen served'and the next packet in the same queue been revealed can the dequeue.ng funchon make its 
next decision. 

' 2. Round Robin scheduling - Queues are inspected in turn in a predetermined sequence. On each v»s.t a pre- 
scribed quota of data may be served. 



6.2 The problem and prior art 

The fundamental problem is how to peform such scheduling algorithms at high speeds. A serialised process can 

processing technology may only be able to provide a couple of system clock cycles per packet, 

On too of this the scheduling and queue management task is further confounded by a / e ^ lreme ^^ 
JuiSrfSSt^di queues. Hardware which executes the scheduling function m a senal manner is 
IKSS Atomise" and therefore inflexible if it is to meet the required performance. 

6.3 Summary of the invention 

A system for maintaining ordered logical data structures in software at high speeds 
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6.4 List of attached figures 
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F/gure 6. 1 Concept oforderlist management using a bin sort approach 
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Figure 6.2 Functional overview of an orderiist management system 




Figure 6.3 Implementation of orderiist manager using MTAP processors and state engines 
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6.5 Detailed description 

introduction 

A third approach to scheduling is to maintain a single, fully ordered queue Instead of multiple FIFO queues. In other 
words, rather than buffer packets up in a set of parralel input queues and then schedule them in some sequence 
into and output queue, sort packets on arrival directly into the output queue. 

In comparison with scheduling approaches 1 and 2 above, this wouid appear to be much more difficult Not only 
must calculations be made at wire speed for each packet prior to enqueueing. packets must be inserted into a 
potentially huge ordered data structure. 

However, this approach enables parallelism to be exploited in the implementation of the solution. When perform- 
ance can not longer be improved through brute force in a serialised solution, the way forward is to find an approach 
which can scale up through its parallelism. 

The -programmable 40 Gbits/s Traffic Handier" proposal describes an innovative parallel processing architecture 
which can provide a sufficient number of processing cycles per packet to enable the finish number calculation to 
be made at wire speed for packets as they arrive from the switch fabric. This proposal describes a solution to the 
other half of the problem - the maintenance of a large orderlist at high speed. ^ i 

Description of the concept and invention 
Figure 1 shows the basic concept of bin sorting: 

• Consider a small set of bins. Each bin is used to contain packets with a certain range of finish numbers. 
The content of a bin is not ordered. 

• A function is required which receives packets and places them in the appropriate bin. 

• Another function is requred which reads the content of each bin in turn in ascending order of the finish 
number ranges. 

. Assume that just as bins are emptied at one end of this sequence, new bins are installed at the other as 
packets arrive with finish numbers which are, on average, constantly increasing in value. 

• Thus, a stream of packets are arriving with randomly varying finish numbers. These are sorted into bins. A 
stream of packets is output in a coarsely sorted order which depends on bin size. 

• The final stage bins can relatively easily be sorted into actual order for output. 

Figure 2 shows an approach that applies this concept. The numbering shows the sequence of events as packets 
arrive, state is accessed, and packets are binned etc. (Full walkthrough could be provided if necesary) 

• Each bin could be implemented as a FIFO queue or LIFO stack in memory. Such data structures may be 
managed by pointers which locate the insertion and removal points for data to/from the structure. 

. The functions that operate on the bins need access to these pointers. The functions could be mapped into 
processors and the pointers into a state memory. 

• A data structure is proposed which comprises more than one set of bins. Within a set of bins the finish 
number range Is constant, but between sets the ranges get progressively smaller. Bins with the widest 
range have the largest finish number values and bins with the smallest range have the smallest finish 
number values. For example, the total range of finish numbers across all bins in one set may equate to the 
range of a single bin in an adjacent set. 

• When a bin is emptied, it is sorted into the next set of bins. 

• Either this is repeated until the finish number range of the final set of bins is unity, OR when the smallest 
bins are empties they are subject to a final sort before forwarding in order. 
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DeteUs of the embodiment orde r(ist manager. The numbering shows the 

through could bo provided if necesary) archltectU re they are well suited to the 

. state Engines used as hardware accelerators can enable the MTAP process . 

logical state required for the fc;ns. Th|s mjnirnlses the required state per bin, 

data from the requested bin. oro ducer to extract the linked list 

make this serialised process efficient. Sman records which 
K Wns now store small entitles (records) of fixed size. 



Additional related design work enqueuing (scheduling) task and the 

nn fieBsa Dsi 1^ h^ncina; The MTAP processors are ^^f^^ented in border that they can cope 
SgSg (fine, sort) task.' A sufficient number of fg^S^SSSii rate is much lower. This would 
with the transient worst case rate ^^^^Z'^^^- ™° P r °P osal ls that 3 T 
mean that a number of processors could /^ffiXniiiw or de queueing tasks. The remainder may float. 

?rtl«^ro^lngforsubsequent reporting tothecontrol plane. 

i^hliSpTe, I felt it might be sufficiently va^bte * the structure, and underiylng memory man- 
needs functions to read and write enWes>g.c^^^ 

agement to efficiently store the structu ^^^^f available memory and allocating memory for the date 
No mention has yet been made o maintaimng a ^Sj^nMi. The efficiency of the orderiist manager 

t^SS^ system which efflcienUy man,p 

ptace of packets they represent. 
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for processing. „, a Hriressof the (first) block of memory in which the . 

. Most significantly, the record will contain the memory address of the (firs, 
packet is stored. 

Packet record handling and storage MnflnftP T his will require the existance of 

records themselves. . ■ MTAPs respectively. Bin memory 

. Bin memory management concept memory provided for packets. 

record memory must also then be free. memory block occupied by 

and recovered, the system is very robust. 

«„^7J, ,^„rri fi A and B which are adjacent in a I nkea list, « r g a | s0 has 'Self_B 




and Next_B'. It can be seen that Nexu* - m - - ~ - h 
. only the -Next" pointer is -core before it in the list, Th.s 

^^^^^^^^^^^ 
nested within the write/read accesses to the other. interC hangeable with a linked list 

reference [3] for further details. 
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Reference material 

[1] ClearSpeed "Network Processor architecture document- V1.0" 

-Architecture proposal for ClearSpeed 40G Network Processor 
[2] A. Spencer "Traffic Handler architecture document - V0.1" 

- Architecture proposal for ClearSpeed 40G Traffic Handler (in progress) 
[3] K. Cameron "Software Queueing for Traffic Management* 

-Discussion document 

6.6 Key features of the invention 

. A single orderiist Is used instead of multiple queues. 

. A method of iterative binning is described which makes the management of large orderlists very efficient 
. Orderiist management is performed entlrelty in software using MTAP processors and accompanying state 
engines. 

. The processing and state resource can be partitioned to provide either single or multiple orderlists. 



6.7 Scope of the claim 

The claim is considered to be original on two fronts: 

Firstly, it is an alternate way of queueing data. Packets (or packet records) are queued directly into an orderiist 
instead of in separate per-stream queues. 

Secondly, it is unlikely that MTAP processors have been used previously to perform the queue ma ^ e ^ n ^ 
ther in terms of the process of making the binning decisions, or in the use of state engines for bin pointer manage- 
ment. 

Because the invention relies on binning and data structure management in software, it is also speculated ^al- 
ternative data structures could be mapped into the hardware resources and managed by different software pjoo- 
ml XS£mSt the invention could have broader application beyond that of suppoorting Traffic Handling. 
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7. Overlapped virtual queueing 
7.1 Background 

Some prior understanding of Traffic Management, Traffic Handling and (virtual) queueing would be a benefit, 
signed from a common pool to a stre am or ^ » is S^vely inactive - an unused resource which should 
the traffic handling system. 

basis A queue is only assigned to a connection when that connection has backiogged traffic. 



7.2 The problem and prior art 

The implementation of virtual queueing presents its own problems: 

1 How are queues deassigned? This could be tricky if there are a number of points in the handler at wh.ch 

packets belonging to a given connection couid be buffered. 
2. How are queues assigned to new connections? Packets belonging to new connections couid appear and 

make an on the spot request for a queue. 
3 The purge that Is necessary In-between a queue being deassigned and it being assigned to a new connec- 
' tion requires significant system wide messaging and state synchronisation. 
This last issue is the core focus of the ClearSpeed "Overlapped Virtual Queueing" concept. 
Consider the simple high level view of traffic handling behaviour Illustrated in figure 1 . 

it is served. 

There is a close relationship between the information held in B and the organisation of state in D. In the context 
of virtual queueing these relationships are more specifically: 

. Lookup entries In B must be created when packets with unrecognised stream labels arrive at A. These 

entries must point to an available Q-state entry in D. 
. Based on accesses from both C and F. D must be capable of determining (a) whether a queue ,s empty, 

and (b) whether the queue is eligible to be de-assigned and returned to the pool in B. 
• When D de-assigns a queue, then the /elated entry in B must be removed. 
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in buffers between A, C and F. The virtual queueing solution must also impose minima! overhead on the system 
in terms of logical complexity, messaging bandwidth and the storage of additional state. 

7.3 Summary of the invention 

A low overhead method for setting up and tearing down virtual queues 

7.4 List of attached figures 




Figure 7.1 Simple schematic view of traffic management 
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Figure 7.2 Queue lookup table 



7.5 Detailed description 

Introduction 

Conventionally, one might deassign a virtual queue from an inactive connection, wait for any residual packets for 
that connection to be purged from the queues, and then reassign the queue to a new connection. This is possible 
but can require a lot of control signalling, additional state and synchronisation across the system. Overlapped vir- 
tual queueing eliminates the purge purge phase and make deassignment and reasslgnmnet simultaneous. 

Instead of a queue being either assigned or unassigned to a connection, It is either assigned or pending re-assign- 
ment. Only after boot-up might a queue actually be unassigned. This means that a queue, normally always belongs 
to a connection. 

In the pending reassignment state it may still be used by the old connection (that is, assuming the previously in- 
active connection suddenly comes back to life!). 

At the moment of reassignment a new ; connection takes ownership of the queue and the old connection may no 
longer place any further packets in it. The old connection will be granted a new queue as and when further packets 
arrive. 

Deassignmnet is thus implicit in reassignment - there is no explicit messaging involved. 
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Instead of purging the residual packe^ 

the traffic handling pipeline, those pack ets are left alone ^^^X^ new connection, with old pack- 

rffiKsa;— - — s when * ey are d ^ 



queued. 



Description of the concept and invention , 
Q^eue lookup : 

« .» -«* •-srirjssssr.sas rs^'ss us 
s^tRss^^r^^ 5 -' — — «— <■— 

Is Illustrated in figure 2 and described as follows. 
The table can be indexed in one of two ways. 
' 1. An entry is Identified by content addressing using the key. 

2. The value (of length N bits) can be used to directly index the table of 2 N entries. 
The required behaviour uses these addressing modes as follows: 

. I *ey is presented to the table In order to lookup a «*. —na, dafa. An exact ma ch , ^qu-red. 

. An unrelated vrt. is attached to each key used for the lookup. If the >ookup Is successfu, then the 
attached value is returned with the results. na/ 

in the additional data field. 

to pool when the result is returned. 

Stajfi management : . 
,n order to support the queue .ookup block, an additional function "• ca " ^ ** «** ° f 

JS deassign Idle queues, and pass their identities to the queue lookups pool. 

Th.sfunction.smostobviouslyasso.atedwiththed = ^ 

Xnfthr™ 

the embodiment section next. 
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which are buffered between the queue lookup ^ deass ignmenv .t must mark that 

queue monitoring function re-categrolses an Idte queue as penm g a C onnection_status 
queues state. Some time later, the first packet of an enq ueuelng logic to work 

flaq set. This provides two clear reference points ^ rthe J u f n e n ™°^ state functions can recognise 
St the iirnbo period <"*^^^^ the queue mentoring 

and conditionally handle any packet baton** to ^SSrtTta enqueuing function may ciear the 

residual packets of the old connection. connection which 

2. Differenuating between packets of the new «3"JySSXSJ!^ «• In the 

are backlogged in the queue structure t s possible ^""""^ and ^ e deqU eueing function sends a 

and new packets. 

Flna „, note that if a .a, «. » +?J2^ZZ££S£m Sff* 

For the queue assignment: 

. content Addressable memory is an idea, memory techndogy for the table memory^ 

. ^ memory , supported by a function which provides and recyc.es the -free queue .denies. 

Reference material 

[1] A Spencer "Traffic Handler architecture document - V0.1* 

- Architecture proposal for ClearSpeed 40G Traffic Handler (in progress) 

7 6 Key features of the invention 

asssigned or pending re-asslgnment. rnnnfiCtlon use by newly arriving packets 

. o.asaSttmna. » **»* * - «"» »"° ^f* T^^T^ Pa*=* » *> 

measured timeout period. 



7.7 Scope of the claim 
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This proposal is only relevant to Traffic Handling application in which many (hundreds of thousands of) queues 
are required. Specifically, this therefore relates to per-flow Traffic Handling. 
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